solution program
GeoDANO: Geometric VLM with Domain Agnostic Vision Encoder
Cho, Seunghyuk, Qin, Zhenyue, Liu, Yang, Choi, Youngbin, Lee, Seungbeom, Kim, Dongwoo
We introduce GeoDANO, a geometric vision-language model (VLM) with a domain-agnostic vision encoder, for solving plane geometry problems. Although VLMs have been employed for solving geometry problems, their ability to recognize geometric features remains insufficiently analyzed. To address this gap, we propose a benchmark that evaluates the recognition of visual geometric features, including primitives such as dots and lines, and relations such as orthogonality. Our preliminary study shows that vision encoders often used in general-purpose VLMs, e.g., OpenCLIP, fail to detect these features and struggle to generalize across domains. We develop GeoCLIP, a CLIP based model trained on synthetic geometric diagram-caption pairs to overcome the limitation. Benchmark results show that GeoCLIP outperforms existing vision encoders in recognizing geometric features. We then propose our VLM, GeoDANO, which augments GeoCLIP with a domain adaptation strategy for unseen diagram styles. GeoDANO outperforms specialized methods for plane geometry problems and GPT-4o on MathVerse.
Fuse, Reason and Verify: Geometry Problem Solving with Parsed Clauses from Diagram
Zhang, Ming-Liang, Li, Zhong-Zhi, Yin, Fei, Lin, Liang, Liu, Cheng-Lin
Geometry problem solving (GPS) requires capacities of multi-modal understanding, multi-hop reasoning and theorem knowledge application. In this paper, we propose a neural-symbolic model for plane geometry problem solving (PGPS), named PGPSNet-v2, with three key steps: modal fusion, reasoning process and knowledge verification. In modal fusion, we leverage textual clauses to express fine-grained structural and semantic content of geometry diagram, and fuse diagram with textual problem efficiently through structural-semantic pre-training. For reasoning, we design an explicable solution program to describe the geometric reasoning process, and employ a self-limited decoder to generate solution program autoregressively. To reduce solution errors, a multi-level theorem verifier is proposed to eliminate solutions that do not match geometric principles, alleviating the hallucination of the neural model. We also construct a large-scale geometry problem dataset called PGPS9K, containing fine-grained annotations of textual clauses, solution program and involved knowledge tuples. Extensive experiments on datasets Geometry3K and PGPS9K show that our PGPSNet solver outperforms existing symbolic and neural solvers in GPS performance, while maintaining good explainability and reliability, and the solver components (fusion, reasoning, verification) are all justified effective.
GOLD: Geometry Problem Solver with Natural Language Description
Zhang, Jiaxin, Moshfeghi, Yashar
Addressing the challenge of automated geometry math problem-solving in artificial intelligence (AI) involves understanding multi-modal information and mathematics. Current methods struggle with accurately interpreting geometry diagrams, which hinders effective problem-solving. To tackle this issue, we present the Geometry problem sOlver with natural Language Description (GOLD) model. GOLD enhances the extraction of geometric relations by separately processing symbols and geometric primitives within the diagram. Subsequently, it converts the extracted relations into natural language descriptions, efficiently utilizing large language models to solve geometry math problems. Experiments show that the GOLD model outperforms the Geoformer model, the previous best method on the UniGeo dataset, by achieving accuracy improvements of 12.7% and 42.1% in calculation and proving subsets. Additionally, it surpasses the former best model on the PGPS9K and Geometry3K datasets, PGPSNet, by obtaining accuracy enhancements of 1.8% and 3.2%, respectively.
GAPS: Geometry-Aware Problem Solver
Zhang, Jiaxin, Jiang, Yinghui, Moshfeghi, Yashar
Geometry problem solving presents a formidable challenge within the NLP community. Existing approaches often rely on models designed for solving math word problems, neglecting the unique characteristics of geometry math problems. Additionally, the current research predominantly focuses on geometry calculation problems, while overlooking other essential aspects like proving. In this study, we address these limitations by proposing the Geometry-Aware Problem Solver (GAPS) model. GAPS is specifically designed to generate solution programs for geometry math problems of various types with the help of its unique problem-type classifier. To achieve this, GAPS treats the solution program as a composition of operators and operands, segregating their generation processes. Furthermore, we introduce the geometry elements enhancement method, which enhances the ability of GAPS to recognize geometry elements accurately. By leveraging these improvements, GAPS showcases remarkable performance in resolving geometry math problems. Our experiments conducted on the UniGeo dataset demonstrate the superiority of GAPS over the state-of-the-art model, Geoformer. Specifically, GAPS achieves an accuracy improvement of more than 5.3% for calculation tasks and an impressive 41.1% for proving tasks. Notably, GAPS achieves an impressive accuracy of 97.5% on proving problems, representing a significant advancement in solving geometry proving tasks.
Deduplicating and Ranking Solution Programs for Suggesting Reference Solutions
Shirafuji, Atsushi, Watanobe, Yutaka
Referring to solution programs written by other users is helpful for learners in programming education. However, current online judge systems just list all solution programs submitted by users for references, and the programs are sorted based on the submission date and time, execution time, or user rating, ignoring to what extent the programs can be helpful to be referenced. In addition, users struggle to refer to a variety of solution approaches since there are too many duplicated and near-duplicated programs. To motivate learners to refer to various solutions to learn better solution approaches, in this paper, we propose an approach to deduplicate and rank common solution programs in each programming problem. Inspired by the nature that the many-duplicated program adopts a more common approach and can be a general reference, we remove the near-duplicated solution programs and rank the unique programs based on the duplicate count. The experiments on the solution programs submitted to a real-world online judge system demonstrate that the number of programs is reduced by 60.20%, whereas the baseline only reduces by 29.59% after the deduplication, meaning that users only need to refer to 39.80% of programs on average. Furthermore, our analysis shows that top-10 ranked programs cover 29.95% of programs on average, indicating that users can grasp 29.95% of solution approaches by referring to only 10 programs. The proposed approach shows the potential of reducing the learners' burden of referring to too many solutions and motivating them to learn a variety of solution approaches.
Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors
Phung, Tung, Pฤdurean, Victor-Alexandru, Cambronero, Josรฉ, Gulwani, Sumit, Kohn, Tobias, Majumdar, Rupak, Singla, Adish, Soares, Gustavo
Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios relevant to programming education; however, these works are limited for several reasons, as they typically consider already outdated models or only specific scenario(s). Consequently, there is a lack of a systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. We evaluate using five introductory Python programming problems and real-world buggy programs from an online platform, and assess performance using expert-based annotations. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios. These results also highlight settings where GPT-4 still struggles, providing exciting future directions on developing techniques to improve the performance of these models.
A Multi-Modal Neural Geometric Solver with Textual Clauses Parsed from Diagram
Zhang, Ming-Liang, Yin, Fei, Liu, Cheng-Lin
Geometry problem solving (GPS) is a high-level mathematical reasoning requiring the capacities of multi-modal fusion and geometric knowledge application. Recently, neural solvers have shown great potential in GPS but still be short in diagram presentation and modal fusion. In this work, we convert diagrams into basic textual clauses to describe diagram features effectively, and propose a new neural solver called PGPSNet to fuse multi-modal information efficiently. Combining structural and semantic pre-training, data augmentation and self-limited decoding, PGPSNet is endowed with rich knowledge of geometry theorems and geometric representation, and therefore promotes geometric understanding and reasoning. In addition, to facilitate the research of GPS, we build a new large-scale and fine-annotated GPS dataset named PGPS9K, labeled with both fine-grained diagram annotation and interpretable solution program. Experiments on PGPS9K and an existing dataset Geometry3K validate the superiority of our method over the state-of-the-art neural solvers. Our code, dataset and appendix material are available at \url{https://github.com/mingliangzhang2018/PGPS}.
Glass-Box Program Synthesis: A Machine Learning Approach
Christakopoulou, Konstantina (University of Minnesota, Twin Cities) | Kalai, Adam Tauman (Microsoft Research, New England )
Recently proposed models which learn to write computer programs from data use either input/output examples or rich execution traces. Instead, we argue that a novel alternative is to use a glass-box scoring function, given as a program itself that can be directly inspected. Glass-box optimization covers a wide range of problems, from computing the greatest common divisor of two integers, to learning-to-learn problems. In this paper, we present an intelligent search system which learns, given the partial program and the glass-box problem, the probabilities over the space of programs. We empirically demonstrate that our informed search procedure leads to significant improvements compared to brute-force program search, both in terms of accuracy and time. For our experiments we use rich context free grammars inspired by number theory, text processing, and algebra. Our results show that (i) running our framework iteratively can considerably increase the number of problems solved, (ii) our framework can improve itself even in domain agnostic scenarios, and (iii) it can solve problems that would be otherwise too slow to solve with brute-force search.
Glass-Box Program Synthesis: A Machine Learning Approach
Christakopoulou, Konstantina, Kalai, Adam Tauman
Recently proposed models which learn to write computer programs from data use either input/output examples or rich execution traces. Instead, we argue that a novel alternative is to use a glass-box loss function, given as a program itself that can be directly inspected. Glass-box optimization covers a wide range of problems, from computing the greatest common divisor of two integers, to learning-to-learn problems. In this paper, we present an intelligent search system which learns, given the partial program and the glass-box problem, the probabilities over the space of programs. We empirically demonstrate that our informed search procedure leads to significant improvements compared to brute-force program search, both in terms of accuracy and time. For our experiments we use rich context free grammars inspired by number theory, text processing, and algebra. Our results show that (i) performing 4 rounds of our framework typically solves about 70% of the target problems, (ii) our framework can improve itself even in domain agnostic scenarios, and (iii) it can solve problems that would be otherwise too slow to solve with brute-force search.